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Many popular measures used in social network analysis, including measures of centrality, are 
based on the random walk. The random walk is a model of a stochastic process where a node 
interacts with one other node at a time. However, the random walk may not be appropriate for 
modeling social phenomena, including epidemics and information diffusion, in which one node may 
interact with many others at the same time, for example, by broadcasting the virus or information 
to its neighbors. To produce meaningful results, social network analysis algorithms have to take into 
account the nature of interactions between the nodes. In this paper we classify dynamical processes 
as conservative and non-conservative and relate them to well-known measures of centrality used in 
network analysis: PageRank and Alpha-Centrality. We demonstrate, by ranking users in online 
social networks used for broadcasting information, that non-conservative Alpha-Centrality leads to 
a better agreement with an empirical ranking scheme than the conservative PageRank. 

PACS numbers: 



I. INTRODUCTION 



Social network analysis algorithms examine topology 
of a network in order to find interesting structure within 
it. It has been recognized recently, however, that net- 
work structure is the product of both its links and the 
dynamical processes taking place on the network, which 
determine how ideas, pathogens, or influence flow along 
social links [TH5]. Borgatti [TJ [S], for example, argued 
that a node's centrality, a measure often used to identify 
important or influential actors in a social network, gives 
a summary of its participation in the flow taking place 
on the network. An appropriate centrality for a given 
network, therefore, is one whose assumptions match the 
details of the flow. Some of the best-known measures 
of centrality, such as PageRank [3] and its variants [5], 
are based on random walk-like phenomena [51 [7] . A ran- 
dom walk on a graph is a stochastic process that starts 
at some node, and at each time step transitions to a 
randomly selected neighbor of the current node. Vari- 
ants of the random walk are often used to model flows in 
physical systems, e.g., chemical and heat diffusion, and 
can be used to model social phenomena resulting from 
one-to-one interactions, such as Web surfing or phone 
conversations. Random walks, however, do not model 
many phenomena of interest to social scientists, such as 
adoption of innovation [9] , spread of epidemics [10l [11] 
and word-of-mouth recommendations [12] . viral market- 
ing campaigns [HI [H] , growth of social movements [15] 
and information diffusion |16j . These phenomena are 
usually modeled as an epidemic process, where rather 
than choosing one neighbor, an activated or "infected" 
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node will attempt to activate all its neighbors. For exam- 
ple, on the social media site Twitter users broadcast their 
posts, called tweets, to all their followers. Similarly, in an 
epidemic, an infectious person will pass the virus to all 
susceptible contacts. Therefore, unlike the random walk, 
which conserves the amount of the diffusing substance, 
epidemic processes are fundamentally non-conservative. 

This paper makes two contributions. First, we 
classify dynamical processes as conservative and non- 
conservative and study their relationship to two well- 
known centrality measures: PageRank and Alpha- 
Centrality. PageRank [4 , originally used in Google's suc- 
cessful search engine, gives the steady state distribution 
of a conservative dynamical process (specifically, random 
walk with random restarts [Hill])- Alpha-Centrality [T7] 
measures the number of paths of any length between two 
nodes, exponentially attenuated by parameter a, so that 
longer paths contribute less to centrality than shorter 
paths. We demonstrate that Alpha-Centrality gives the 
steady state distribution of a class of non-conservative 
dynamical processes while a is bounded by inverse of 
the largest eigenvalue of the adjacency matrix of the 
graph. This quantity, called the epidemic threshold, gov- 
erns the behavior of many non- conservative processes in 
networks, for example, the spread of a virus along social 
links [THl US] ■ When the effective transmissibility of the 
virus is below this threshold, it will die out [111 HO]) but 
above the threshold it will reach a finite fraction of all 
nodes, resulting in an epidemic. Our analysis provides 
an intuitive explanation for the location of the epidemic 
threshold, and a further demonstration of the fundamen- 
tal connection between network structure and dynamics. 

The second contribution of the paper is an empirical 
study of the ability of PageRank and Alpha-Centrality 
to identify influential social media users. Specifically, we 
study the online social networks of the social news aggre- 
gator Digg and the microblogging service Twitter, both 
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of which are used by people to share news stories and 
other content with their followers. The spread of infor- 
mation is often modeled as an epidemic process [2"TH2~i] . 
hence it has a non-conservative flavor. We define two em- 
pirical measures of influence based on user activity, and 
rank users according to these measures. We show that 
non-conservative Alpha-Centrality generally leads to a 
better agreement with the activity-based rankings than 
conservative PageRank. While the effect of dynamical 
processes on centrality was studied theoretically and in 
simulation [T] , our work provides an empirical demonstra- 
tion that the choice of centrality impacts our ability to 
identify important people in real- world social networks. 

This paper is organized as follows. In Section|TT]we pro- 
vide a description of conservative and non-conservative 
dynamical processes and demonstrate, in Section |III[ 
that Alpha-Centrality gives the steady state distribution 
of a non-conservative dynamical process, for example, 
a spreading epidemic. Then, in Section IV we com- 



pare Alpha-Centrality to PageRank on the task identi- 
fying influential social media users and show that Alpha- 
Centrality gives a better agreement with empirical mea- 
sures of influence. We conclude with a summary of re- 
lated work and a conclusion. 



II. CLASSIFICATION OF DYNAMICAL 
PROCESSES 

We represent a network by a directed graph G = (V, E) 
with V nodes and E edges. The adjacency matrix of the 
graph is defined as: A[u, v] — 1 if (u, v) € E; otherwise, 
A[ti, u] = 0. Also, A[u, u] = 0, The set of out-neighbors 
of u is {v € V\(u,v) € E}; and the set of in-neighbors 
is {v € V\(v,u) G E}. Another important quantity is 
the diagonal out-degree matrix D, which is defined as 
D[i,i] = J2j A ihj] = Ae T and D[i,j] = V i ^ j. Here, 
e is a | V | -dimensional row vector of ones, and e T is its 
transpose. 

A dynamical process is mediated by interactions be- 
tween nodes, which can be thought to distribute some 
quantity, or weight, on a network. Let the | Tri- 
dimensional vector x represent the weight of each node at 
time t. A dynamical process is described mathematically 
by a function F t (x) that maps the weight vector at time 
t to the weight vector at time t + 1. 



A. Conservative Processes 

A stochastic process is conservative if it simply redis- 
tributes the weights among the nodes of the graph, with 
the total weight remaining constant: ||x||i = | |F t c (x)| |i, 
where ||.||i represents the Li-norm of the argument, i.e., 

To give an intuition for the mathematical formulation 
of conservative processes, imagine a society where nodes 
interact by redistributing money among themselves, and 



the money cannot be created or destroyed. Let x c (t) 
be the amount of money each node has, and A(t) the 
amount it receives, at time t. Suppose that at each time 
step a node retains a fraction (1 — a) of the amount it 
received in the previous step and redistributes the rest 
among its neighbors. Let transfer matrix T[p, q] repre- 
sent the fraction of the amount transferred by node p to 
q. Therefore, the amount of money nodes receive at time 
t + 1 is A(t + 1) = oA(i)T. The transfer matrix encodes 
the rules of interaction. If each member divides aA(t) 
equally amongst her out-neighbors, then T = D~ 1 A. 

Step by step, conservative process looks as follows. Ini- 
tially, the amount each node receives is A(0) = x c (0). 
At time t = 1 each node keeps (1 — a) of that amount 
and divides the rest among its out-neighbors, who receive 
A(l) = aA(0)T = ax c (0)T. At time t = 2, each node 
retains (1— a) of the amount it received from in-neighbors 
at t = 1, and divides the rest among its out-neighbors, 
who receive A(2) = oA(l)T = a 2 x c (0)T 2 , and so on. 
The total weight (or amount of money) the nodes have 
at time i, x c (t), is the amount they retained from all 
previous time steps and the amount they received from 
in-neighbors at time t: 



x&) = (l-a)^A(fc)+A(t) 



fe=0 



5^(1 - a)a k x c {0)T k + a*JB c (Q)T* 



fe=0 



= (l-a)x c (0) + ax c (t-l)T. 
As t — ¥ oo, this equation reduces to 



(1) 



x c (t — > oo) = (1 — a)x c (0) + ax c (t — > oo)T 

- (1 - a)x c (0)(I - aT)- 1 (2) 

The transfer matrix T is a stochastic matrix, since 
its rows sum up to 1. If, instead of distributing evenly 
among neighbors, each node decided to keep a portion S 
for itself, this variant of a conservative process would be 
governed by the transfer matrix: 



T = 6I+(1-5)D- 1 A. 



(3) 



Random walk on a graph is a prototypical conserva- 
tive process, since the probability to find a walker on any 
node of the graph is always one. There exist many flavors 
of random walk. One of them is the widely studied ran- 
dom walk with random restarts [H [71 [5S] , which can be 
described mathematically as follows. Let the initial prob- 
ability to find the walker on any node be uniform, i.e., 
x c {0) = e ]V\- At any time, with probability a the walker 
at node p randomly chooses one of the out-neighbors of 
p and jumps to it. With probability (1 — a), it randomly 
chooses any node on the graph and jumps to it. Let ma- 
trix S encode the probability of jumping to any node, 
S[p,q] = rar, and T = D^ 1 A. Then the probability of 



3 



finding the random walker at node q at time t is given by 

x c (t) = (l-a)x c (t-l)S + ax c (t-l)T 
= (l-a)x c (0) + ax c (t-l)T, 

which is exactly the same as Eq. [T] 



B. Non-Conservative Processes 

A stochastic process where the total weight can change 
over time is non-conservative: \\x\\i 7^ 1 1 -P^ 1 ( ;c ) 1 1 1 . To 
illustrate the difference between conservative and non- 
conservative processes, we return to our hypothetical so- 
ciety. Again, imagine that each node has some amount of 
money, however, it also has a money minting machine, so 
that instead of dividing the money it receives among its 
out- neighbors, it can give each neighbor the same amount 
by printing extra as needed. 

Let A(t) represent the amount of money each node re- 
ceives at time t. At the next time step, each node gives a 
fraction a of this amount to each of its out-neighbors, 
printing extra as needed. The additional amount it 
produces can be expressed using the replication matrix 
TZ = A. Therefore, A(t + 1) = aA(t)K. Initially, let 
A(0) = x n (0). At time t = 1, each node prints aA(0) for 
each out-neighbor: A(l) = aA(0)lZ = ax n (0)lZ. Con- 
tinuing this process, additional amount out-neighbors re- 
ceive at time t is A(t) = aA(i - l)K = a'x„(0)^*. The 
total amount each node has at time t is obtained by sum- 
ming what it received from in-neighbors at previous time 
steps: 

t t 

x n (t) = j2 A (k) = J2 x ^ a7Z ^ k 

fc=0 k=0 

= x n (Q) + ax n (t - 1)K (4) 
At time t 00, Eq. [4] reduces to 

t— foo 

x n (t^oo)=x n (0)J2(aTl) k , (5) 

k=0 

which can be solved to yield 

Xn(t — »• 00) = x n (0) + x n (t — > oo)(alZ) 

= Xn (o)(i-any\ (6) 

This expression is defined for a < 1/Ai, where Ai is the 
largest eigenvalue of TZ. 

More generally, if along with producing a of what it 
receives from each in-neighbor, a node also produces a 
portion <5 of this amount for itself, this leads to a more 
general form of the replication matrix: 

8 

TZ=-I + A. (7) 
a 



1. Non-Conservative Dynamics and Epidemic Threshold 

Non-conservative processes provide a useful framework 
for thinking about epidemics and other contact processes 
and lead to insights into the relation between dynami- 
cal processes and network structure. Consider a virus 
spreading on a network, where at each time step, a con- 
tagious node may infect its susceptible neighbors with 
probability /1 (virus birth rate). At each time step, an 
infected node may also be cured with probability f3 (virus 
curing rate). Wang et al. [19] modified existing models of 
SIS dynamics |26| for use on networks. The probability 
Pi t i that node i is infected at time t can be written in 
matrix notation as |19j : 

P t = P t -i{(l-0)I + iiA) =Po((l-/3K + M)* ) (8) 

where Pt is a vector (pu, P2,t, ■ ■ ■), and Pq is the initial 
probability of infection. [52] P t is exactly equal to the ad- 
ditional weight, A(t), accrued by a non-conservative pro- 
cess in Eq. 4 with TZ = + A and a = fj,. Therefore, 
a SIS-type epidemic is an example of a non-conservative 
dynamic process. 

In the model in Eq. [8] there exists a threshold /z c such 
that when the effective transmissibility of virus fi/ f3 < ji c , 
it will die out, and for /i//3 > fi c it will spread to a sig- 
nificant portion of the network. For any network, re- 
gardless of the details of the spreading mechanism |20j . 
this threshold is given by the inverse of the largest eigen- 
value of the adjacency matrix A, fi c = l/|Ai| |19| . what is 
known as the spectral radius of the graph. In numerical 
experiments we simulated epidemics on different graphs 
using the independent cascade model [23] . We found that 
the observed threshold where epidemics began to reach 
many nodes was consistent with the spectral radius of 
the respective graph. 

Threshold behavior appears to be a generic property 
of non-conservative dynamics. As shown in the Ap- 
pendix, the expected path length of a non-conservative 
process, i.e., how far the process spreads as t — > 00, 
is I = (1-aAi)- 1 for a < l/|Ai| and I - 0(t) for 
a > 1/1 Ai|. Therefore, expected path length I diverges 
as a approaches l/|Ai| from below. This is a hallmark 
of critical behavior. For non-conservative processes, the 
critical behavior is associated with the epidemic thresh- 
old, below which the non- conservative process reaches 
very few nodes, but above which is reaches a significant 
fraction of all nodes. 

There is another way to think about thresholds. 
Among epidemiologists, the principal quantity of inter- 
est is the reproductive number, Rq |27j . Intuitively, this 
quantity is just average number of new infections caused 
by a single infected person. If i? > 1, each infection 
creates new infections indefinitely, and results in an epi- 
demic, while for Rq < 1, the disease eventually dies out. 
Naively, the reproductive number should just be the av- 
erage degree times the transmissibility, or contagiousness 
of the virus. For the Digg follower graph, for example, the 
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average degree (k) 6 so Rq « 6A, where A is the trans- 
missibility of the virus. In that case, an epidemic thresh- 
old at Rq = 1 — » A c s» 1/6, much higher than we observed 
in simulations of an SIR epidemic (using independent 
cascade model) on the Digg follower graph [53]. While 
heterogeneous degree distribution (a common property of 
social networks) can lower the threshold compared to this 
prediction |28j . this computation is not simple, making 
the basic reproductive number less useful in characteriz- 
ing epidemics in social networks. 



III. DYNAMICAL PROCESSES AND 
CENTRALITY 

The complex interplay between network structure and 
dynamics has broad implications for social network anal- 
ysis. Take the task of identifying influential or prestigious 
actors in a social network. Over the years many differ- 
ent centrality measures have been developed to address 
this task, including degree centrality, betweenness cen- 
trality [H] , eigenvector centrality [30] , PageRank [3] and 
Alpha-Centrality |17j . among many others. Applied to 
the same network, however, each measure leads to a dif- 
ferent, even conflicting notion, of who the central actors 
are. In order to make sense of the scores produced by 
each centrality measure, it is important to consider the 
nature of the dynamical process on the network. 



A. Centrality Measures 

We study PageRank and Alpha-Centrality, two widely 
used measures of centrality, and show their relationship 
to conservative and non-conservative processes. 

a. PageRank A PageRank vector pr a (s,t) is the 
steady state probability distribution of a random walk 
with restarts with a damping factor a (restart proba- 
bility= 1 — a). The starting vector s, gives the proba- 
bility distribution for where the walk transitions to af- 
ter restarting. The transfer matrix encodes the tran- 
sition probabilities of a random walk on the network, 
W = D~ 1 A. PageRank vector pr a (s) is the unique solu- 
tion of: 



P r a( s ) = (1 _ a ) s + apr a (s)W 



(9) 



Equation [9] is identical to the steady state solution of 
the linear conservative dynamic process given by Eq. [2] 
where W = T = D~ 1 A and s = x c (0). Therefore, 
PageRank is the steady state solution of a conservative 
process, and it is a conservative measure. Other mea- 
sures derived from the random walk, such as betweenness 
centrality, are also conservative. 

b. Alpha-Centrality Alpha-Centrality measures the 
total number of paths from a node, exponentially atten- 
uated by their length. Bonacich introduced this mea- 
sure [17] as a generalization of the index of status pro- 
posed by Katz |31| . and it is sometimes referred to as 



Bonacich centrality. It is also similar to the commu- 
nicability index recently explored by the physics com- 
munity |32) . For an attenuation parameter a, Alpha- 
Centrality vector cr a (s) is the solution of: 



cr «( s ) = s + acr a (s)A, 



(10) 



where the starting vector s is taken as indegree centrality, 
s = eA [33] . with e a row vector of ones. Equation 10 
holds while \a\ < l/|Ai|, the spectral radius of the net- 
work. This bound, in fact, is the same as the epidemic 
threshold (Section II B 1 1 . For positive values, parameter 
a determines how far, on average, a node's effect will be 
felt and sets the length scale of interactions. [53] When 
a is small, Alpha-Centrality probes only the local struc- 
ture of the network. As a grows, more distant nodes 
contribute to the centrality score of a given node |34) . 
As a — > 1/Ai, the length scale of interactions diverges 



(Sec. II B 1 1 and it becomes a global measure. 

One difficulty in using Alpha-Centrality is that it is 
not defined for a > 1/Ai. We recently introduced 
normalized Alpha-Centrality that overcomes this prob- 
lem [34]. It normalizes the score of each node by the 
sum of the Alpha-Centrality scores of all the nodes. The 
new measure avoids the problem of bounded parameters 
while retaining the desirable characteristics of Alpha- 
Centrality, namely its ability to differentiate between lo- 
cal and global structures. Normalized Alpha-Centrality 
ncr a (s) is written as: 



ncr a (s) 



1 



|cr a (s) 



-cr Q (s) 



(11) 



This is defined for < a < 1 (a ^ l/|Ai|). This value 
changes with a for a < 1/Ai. For a > 1/Ai, normalized 
Alpha-Centrality is independent of a and the ordering 
found by normalized Alpha-Centrality in this parameter 
range is equivalent to the ordering found by eigenvector 
centrality [35] . 

Alpha-Centrality and its normalized version are equiv- 
alent to Eq.[5j with the initial distribution of weight given 
by x n (0) = c ■ s, where c = 1 for Alpha-Centrality and 



for normalized Alpha-Centrality. Note that we use nota- 
tion ||M||i = Y,ij M ihj] for an y matrix M. There- 
fore, (normalized) Alpha-Centrality is the steady state 
solution of a non-conservative dynamic process. Vari- 
ations of non-conservative dynamics lead to other non- 
conservative measures, such as degree centrality, Katz in- 
dex [3T] , SenderRank [36] , and eigenvector centrality [30] . 



B. Choosing Appropriate Centrality Measure 

When applied to the same network, different measures 
of centrality may lead to different, often incompatible, 
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views of who central actors are. The natural question 
to ask is: Which centrality measure is appropriate for a 
given network? The choice of centrality must be moti- 
vated by details of the dynamical process taking place 
on the network [Tj. Thus, a conservative measure such as 
PageRank is appropriate for analyzing networks on which 
conservative processes, such as web surfing or money ex- 
change, are taking place. However, for a social network 
on which information or epidemics are spreading, a non- 
conservative measure, such as Alpha-Centrality, may be 
more appropriate. 



IV. EMPIRICAL STUDY OF CENTRALITY 

In this section we use social media data to evaluate the 
claim that the measure that best identifies central nodes is 
one that captures details of the dynamical process taking 
place on the network. Social media sites such as Face- 
book, Twitter, and Digg have become important hubs 
of social activity and conduits of information. Correctly 
identifying central or influential users in these networks 
can have far-reaching consequences for identifying note- 
worthy content, targeted information dissemination, and 
other applications. While a variety of methods [37ti40] 
have been used to identify influential social media users, 
each measure produces different results, with no clear 
understanding of when it is appropriate. Fortunately, 
by exposing user activity, social media provides a rare 
opportunity to study the role of dynamic processes on 
networks. 

Both Digg and Twitter allow users to create social net- 
works by listing others as friends. The friend relationship 
is asymmetric. When user A lists B as a friend (A—> B), 
A follows B's activity, but not vice versa. We call A the 
follower of B (or fan on Digg). When follower graph is 
represented in matrix form, a user's indegree measures 
the number of followers she has, and her outdegree the 
number of friends she follows. 

By submitting a story to Digg (or tweeting a URL to 
a story on Twitter), a user broadcasts it to her followers. 
When another user votes for the story, she re-broadcasts 
it to her own followers. Broadcast-driven information 
diffusion has a non-conservative flavor; therefore, a non- 
conservative centrality measure should better identify in- 
fluential users. 

We analyzed information diffusion on the follower 
graphs of Digg and Twitter and used this data to con- 
struct an empirical estimate of user influence. We then 
compared how different centrality measures compared to 
the empirical measure of influence. 



A. Data Sets 

The Digg dataset[54] contains more than 3 million 
votes on some 3500 stories promoted to Digg's front page 
in June 2009. More than 139K distinct users voted for at 



least one story in the data set (submission counts as the 
story's first vote). We call these users active users. Next, 
we extracted the friendship links created by active users 
and constructed a follower graph that contained active 
users who were following the activities of others. How- 
ever, only about 71K active users listed others as friends, 
resulting in network with around 300K users and over 1 
million links. 

The Twitter data set was collected over the period 
of three weeks in October 2010 using the Gardenhose 
streaming API. We focused on tweets that included a 
URL in the body of the message, usually shortened by 
some URL shortening service, such as bit.ly or tinyurl. 
In order to ensure that we had the complete retweeting 
history of each URL, we used Twitter's search API to re- 
trieve all tweets containing that URL. Users who tweeted 
the URL are considered active. Data collection process 
resulted in more than 3 million posts tweeted by 816K 
users which mentioned 70K distinct shortened URLs. 
Next, we used the REST API to collect followers of each 
active user, keeping only those followers who themselves 
were active, i.e., tweeted at least one URL during data 
collection period. The resulting follower graph had al- 
most 700K nodes and over 36 million edges. While fil- 
tering out non-active followers will change results of cen- 
trality calculations, we argue that this is an appropriate 
simplification to make, both conceptually and to keep 
the graph of a computationally manageable size. We ar- 
gue that inactive users do not contribute to information 
spread, and should not be considered in calculations of 
centrality. 

While voting on Digg represents pure information dif- 
fusion (in contrast to Twitter, Digg user can vote only 
once for a story), tweeting activity in our sample en- 
compassed diverse behaviors from pure information dif- 
fusion of newsworthy content to orchestrated manipu- 
lation campaigns, robo-tweeting, advertising and spam. 
Since our analysis applies only to information diffusion- 
type behavior, we have to filter out latter activities. We 
used a method described in [41] to automatically clas- 
sify tweeting behaviors using two information theoretic 
features. The first feature is the entropy of the distri- 
bution of distinct users who re-tweeted the URL. The 
second feature is the entropy of the distribution of time 
intervals between successive re-tweets of the same URL. 
We showed that these two features alone were able to 
accurately separate re-tweeting activity into meaningful 
classes. High user entropy implies that many different 
people re-tweeted the URL, with most people re-tweeting 
it once. High time interval entropy implies presence of 
many different time scales, which is a characteristic of 
human activity. In contrast, low time interval entropy 
implies that URL is retweeted at one or few regular time 
intervals, which is characteristic of automated (possibly 
spam) activity. In this paper, we focus on those URLs 
from the data set which are characterized by high (> 3) 
user and time interval entropies. These parameter values 
are associated with the spread of news-worthy content 
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and excludes robotic spamming and manipulation cam- 
paigns driven by few individuals. 



B. Empirical Estimates of Influence 

Katz and Lazarsfeld [32] defined influential as "in- 
dividuals who were likely to influence other persons in 
their immediate environment." In the years that fol- 
lowed, many attempts were made to identify people who 
influenced others to adopt a new practice or product |43j . 
The rise of online social networks has allowed researchers 
to trace the flow of information through social links on a 
massive scale. Using the new empirical foundation, some 
researchers proposed to measure a person's influence in 
social media locally, by the number of votes or retweets 
from followers her posts generate [35] 00], or globally, 
by the size of cascades her posts trigger [T3J [3D] . Alter- 
natively, Trusov et al. [33] defined influential people in 
an online social network as those whose activity stimu- 
lates those connected to them to increase their activity, 
while Cha et al. [37] used the total number of retweets 
and mentions, including from people not connected ei- 
ther directly or indirectly to the submitter, to measure 
user influence on Twitter. 

Following these works, we measure influence by ana- 
lyzing user activity in social media. Suppose that a user 
posts new information on Digg or Twitter, specifically, a 
URL to a news story. We refer to this user as the story's 
submitter. Whether or not her follower will re-broadcast 
the story (i.e., retweet it on Twitter or vote for it on 
Digg) depends on its quality and submitter's influence. 
We assume that story's quality is uncorrelated with the 
submitter. 55J Therefore, we can average out its effect by 
aggregating over all stories submitted by the same user. 
We claim that the residual difference between submitters 
can be attributed to variations in influence. We use two 
empirical measures of submitter's influence: (i) the aver- 
age number of times her submissions are re-broadcast by 
her followers (local influence [40]), and (ii) average size 
of the cascades her posts trigger (global influence [40]). 



1. Measuring local influence on Digg 

To reduce the effect of the front page to which Digg 
promotes popular stories, we count the number of votes 
from submitter's followers within the first 100 votes only. 
Since few stories are promoted to the front page before 
they receive that many votes, this ensures social links are 
mainly responsible for spreading interest in stories [To] . 
Of the 3552 stories in the Digg data set, 3489 were sub- 
mitted by 572 connected users. Of these, 289 distinct 
users submitted two or more stories which received at 
least one follower vote within the first 100 votes, pro- 
viding us with enough information to estimate influence. 
Figure [l|a) shows the average number of follower votes 
(k) within the first 100 votes received by stories submit- 
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FIG. 1: Analysis of the empirical estimate of influence on 
Digg and Twitter, (a, c) The scatter plot shows the average 
number of times followers rebroadcast a story within its first 
100 rebroadcasts vs. the number of followers the submitter 
has. Each point represents a distinct submitter, (b, d) Prob- 
ability of the expected number of follower rebroadcasts being 
generated purely by chance. 



ted by these users versus the number of followers K these 
users have. 

Are these observations significant? Do submitters with 
more followers simply get more votes due to greater num- 
bers of followers? Or could we have observed that many 
follower votes purely by chance? Let's assume that there 
are N users who vote for stories randomly, independently 
of who submits them. This type of stochastic voting can 
described by the urn model |45j . Imagine an urn that 
contains N balls, of which K are white. Imagine also 
that we draw n balls from the urn without replacing 
them. How many of them will be white? The proba- 
bility that k of the first n votes come from submitter's 
followers purely by chance is equivalent to the probability 
that k of the n balls drawn from the urn are white. This 
probability is given by the hypergeometric distribution: 



P{X = k\K, N, n) 



N -K 
n — k 



(12) 



Using Eq. 12 



we compute the probability P(X = 
(k)\K,N,n) (N^71367, n=100) a story submitted by a 
Digg user with K followers received (k) votes from sub- 
mitter's followers purely by chance. As shown in Fig- 
ure |T|(b) , for K > 100, this probability is very small; 
therefore, it is unlikely (P < 0.05) these votes could arise 
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purely by chance. We conclude that average number of 
follower votes received by stories submitted by a user 
(with at least 100 followers) is a statistically significant 
(P < 0.05) measure of her influence. 



2. Measuring local influence on Twitter 

We analyzed the Twitter data set using the same 
methodology. There were 174 users who posted at least 
two URLs that were retweeted at least 100 times. Fig- 
ure [l|c) shows the average number of times the posts 
of these users were retweeted by their followers. Fig- 
ure ljd) shows the probability these number of retweets 
could have been observed purely by chance. Since these 
values are small, we conclude that average number of 
follower retweets is a statistically significant (P < 0.05) 
estimate of influence on Twitter. 



3. Measuring global influence 

Alternatively, we can measure the influence of the sub- 
mitter by the average size of the cascades her posts trig- 
ger. A cascade describes how information spreads on a 
follower graph. The cascade begins with a seed, e.g., 
story submitter, who broadcasts the story to her fol- 
lowers. It grows when these followers choose to vote or 
retweet the story, in turn broadcasting it to their own fol- 
lowers, and so on. All nodes in a cascade are connected 
to the seed through follower relations, either directly or 
indirectly though other nodes in the cascade. 

For each post, we extracted the cascade that starts 
with the submitter and includes all voters/retweeters who 
are connected nodes in the cascade via follower relations. 
The larger the cascade size (on average), the more influ- 
ential the submitter. 



C. Comparison of Centrality Measures 

We use empirical estimates of influence to rank a subset 
of users in the Digg and Twitter data sets who submit- 
ted more than one story (URL) which received at least 
100 votes (retweets). We evaluate centrality measures 
by comparing the rankings of users they produce to the 
empirical rankings. 

We studied standard PageRank (with uniform start- 
ing vector) and Alpha-Centrality (with in-degree as the 
starting vector), both of which were computed on the 
follower graph. The effect of using other starting vectors 
for PageRank (as is done in personalized PageRank) and 
Alpha-Centrality is the course of future work. We use 
Pearson's correlation coefficient (since ties in rank may 
exist) to compare the rankings produced by the two cen- 
trality measures with the empirical activity-based rank- 
ing. 



Figure [2] shows the correlation between local influ- 
ence (empirical ranking based on the average number 
of follower re-broadcasts) and that produced by Alpha- 
Centrality and PageRank on Digg and Twitter. Param- 
eter a stands for the attenuation factor for (normal- 
ized) Alpha-Centrality (see Equations 10 and|TT]) and the 
damping factor (restart probability =1 — a) for PageRank 
(see Equation^. Except for values of a close to 1, Alpha- 
Centrality correlates better with the empirical estimates 
of influence than PageRank on Digg (Figure[2ja)). Of the 
174 Twitter users who posted at least two URLs that 
were retweeted more than 100 times, only 75 could be 
classified as not spammers according to the entropy cri- 
teria [H] mentioned above. Figure [2jb) shows the corre- 
lation between the local empirical measure of influence of 
these users with PageRank and Alpha-Centrality. Alpha- 
Centrality outperforms PageRank for all values of a. 

The inset shows the interval corresponding to small 
values of a. Note that normalized Alpha-Centrality be- 
comes a global metric very quickly, i.e., over a small range 
of a values. The point at which it becomes constant cor- 
responds to the epidemic threshold. There are interesting 
differences in the behavior of correlation with the empir- 
ical measure of influence on Digg and Twitter. On Digg, 
the correlation with Alpha-Centrality grows from a — 0, 
suggesting that global structure becomes more important 
in determining influence, while on Twitter it has the op- 
posite behavior. These differences could arise from dif- 
ferences in the network structure, and will be studied in 
our future research. 

Figure [3] shows the correlation of the global empirical 
estimate of influence (average cascade size) with Alpha- 
Centrality and PageRank for the same sets of submitters. 
Though the correlation with global influence on Digg is 
less overall than for local influence, Alpha-Centrality out- 
performs PageRank for all values of a. Surprisingly, the 
correlations on Twitter are negative. This is consistent 
observations of Bakshy et al. [ID], who found that cas- 
cade size of past submissions was not a good predictor of 
the cascade size of a user's future submissions on Twit- 
ter. The correlations are less negative for PageRank, but 
it is difficult to conclude anything about the relative per- 
formance of Alpha-Centrality and PageRank from these 
results. 

The empirical results, for the most part, support our 
claim that Alpha-Centrality is better able to identify im- 
portant users than PageRank because it more closely 
models the dynamic process taking place on the fol- 
lower graph: the spread of information through broad- 
casts from users to their followers. PageRank outper- 
forms Alpha-Centrality on Digg for a > 0.85. Inciden- 
tally, a = 0.85 was the value originally suggested by Brin 
et al. [3] for finding important pages in a Web graph. 
Empirical studies suggest different values of a are appro- 
priate for different domains 46], although some authors 
caution [25] against using values of a close to one. Since 
it is not clear what value of a would be appropriate for 
social networks, Alpha-Centrality's better overall perfor- 
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FIG. 2: Correlation between the rankings produced by the local empirical measure of influence (average number of follower 
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mance suggest that it is better suited for identifying in- 
fluential users on Digg. Alpha-Centrality performs better 
than PageRank on Twitter for all values of a. 

Results of correlation of centrality with the global 
measure of influence are less conclusive. While Alpha- 
Centrality does correlate better than PageRank with this 
measure on Digg for all values of a, on Twitter these mea- 
sures are anti-correlated. One possible explanation could 
be differences in the user interface on these sites. Another 
possibility is that information spread cannot be modeled 
as a simple epidemic EZ] > and these differences are 
more pronounced on Twitter. Yet another explanation 
could be differences in network structure on the two sites, 
or simply an artifact of the biases introduced by our ag- 
gressive spam filtering or small size of the data set. We 
are addressing these questions in ongoing work. 



V. RELATED WORK 

The interplay between structural properties of net- 
works and the diffusion processes occurring on them con- 
tribute to their complexity. This has been realized by 
several researchers in the past. For example, Lambiotte 
et al. [3 , 48 emphasized that dynamical processes play an 
important role in characterizing the structure of complex 
networks. In |48j they measure the quality of a network 
partition in terms of the statistical property of the dy- 
namic process taking place in the network. In [3] they 
study the different equilibrium properties of these pro- 
cesses. However, their works focus on what we call con- 
servative processes: unbiased and biased random walks, 
discrete and continuous time random walks. In contrast, 
we also study non-conservative dynamical processes. We 
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also relate these processes to centrality. Although the re- 
lationship of PageRank to random walk-type processes is 
well known, we explain how Alpha-Ccntrality is related 
to a type of a non- conservative process. We also carry 
out an empirical study of different centrality measures, 
unlike previous works. 

Non-conservative processes are useful for studying a 
wide range of social phenomena, including the spread 
of epidemics within a population and information dif- 
fusion in social media, viral marketing, and many oth- 
ers. Many of these phenomena have been investigated 
by other researchers. The study of epidemics, in partic- 
ular, has a very long history [TT] It is known, for 
example, that epidemics exhibit critical behavior, and 
that the threshold of critical behavior is related to net- 
work structure pTSVEU] . The present work further con- 
firms the relationship between epidemic threshold and 
network structure. Moreover, it gives an intuitive ex- 
planation for critical behavior of epidemics in terms of 
diverging length scales of non-conservative interactions. 

Borgatti [1] suggested a link between centrality and dy- 
namical processes, defining a node's centrality in terms 
of its participation in the flow taking place on the net- 
work [2]. Therefore, he claimed, the appropriate centra- 
lity measure for a given network is one that takes into 
account the details of the flow. He proposed a typol- 
ogy of flows, based on the trajectories they follow (e.g., 
geodesies, paths, trails) and the mechanism of spread 
(e.g., transfer or broadcast), and used simulations to ex- 
plore the relationship between flows and centrality mea- 
sures, such as betweenness, degree, and eigenvector [30] 
centralities. He showed that centrality whose assump- 
tions matched details of the flow was able to better re- 
produce key observations, such as how quickly or how 
frequently the flow reached a node. For example, a flow 
that follows geodesies (shortest paths) frequently visits 
nodes with highest betweenness centrality [29]. We pro- 
pose a simpler classification scheme that differentiates 
flows based on whether or not they conserve the flowing 
quantity. Unlike Borgatti's work, we mathematically ex- 
plore the relationship between different flows and centra- 
lity and empirically study differences between centrality 
measures. 

Estrada et al. [32] studied measures similar to Alpha- 
Centrality and personalized PageRank (with attenuation 
factor 1) which they call communicability. They linked 
the communicability functions to dynamics by showing 
their relationship to the thermal Green's function of os- 
cillators. They used communicability to identify impor- 
tant actors in small social networks, demonstrating that 
different communicability functions led to different judge- 
ments of centrality, but did not justify the choice of the 
particular communicability function in terms of the inter- 
actions taking place between actors. Although we study 
a similar function, the goal of our work is to contrast 
conservative and non-conservative dynamics and explain 
how these differences should guide the choice of centrality 
measure for a given social network. 



Researchers are increasingly turning to social media 
data sets to study the properties of complex networks. 
Some studies used activity-based measures, such as the 
number of mentions or re-tweets |37l 140] to identify im- 
portant social media users. Besides correlating these 
activity-based measures with degree centrality [37] . no 
study has investigated centrality in social media. Our 
focus in this paper is to justify the choice of centrality by 
taking into account the dynamical processes taking place 
on the network. 



VI. CONCLUSION 

We described two fundamentally distinct dynamical 
processes on networks, which can be differentiated based 
on whether or not they conserve some quantity that is 
distributed on the network, and studied their relation- 
ship to two well-known centrality measures used to iden- 
tify important or influential actors in a social network: 
PageRank and Alpha-Centrality. While PageRank rep- 
resents a steady state distribution of a conservative dy- 
namic process on a network, we showed that Alpha- 
Centrality is a solution to a non-conservative process, 
examples of which are epidemics and information dif- 
fusion. By analyzing data about information diffusion 
in social media, we found that Alpha-Centrality tends 
to better correlate with the empirical measures of influ- 
ence than PageRank, since it takes into account the non- 
conservative nature of information diffusion. Centrality 
is but one type of measurement of network structure. 
Other types of measurements, for instance, community 
detection, may also be affected by the nature of the dy- 
namic processes occurring on networks and will be ad- 
dressed in future works. 



Appendix 

Replication matrix 1Z can be written in terms of its 
eigenvalues and eigenvectors as: 

\v\ 

K = XAX- 1 = (13) 

i=l 

where A is a matrix whose columns are the eigenvectors 
of 1Z. A is a diagonal matrix, whose diagonal elements 
are the eigenvalues, A« = A^, arranged according to the 
ordering of the eigenvectors in X. Without loss of gener- 
ality we assume that Ai > A2 > • ■ ■ > A n . The matrices 
Yi can be determined from the product 

Yi = XZ.X- 1 (14) 
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where Zi is the selection matrix having zeros everywhere 
except for element (Zi) u = 1 [49]. Therefore, 



S(a,t) = ]T(aft) fe 

n 



i=l 



(-i) x -(i-^a: +1 ) 

(-1)^(1 -a\i) 



Yi (15) 



where li — if a |A»| < 1 and li = 1 if a |Aj| > 1. As 
obvious from above, for Equation[l5]to hold non-trivially, 
a 7^ jX 7 ]^* "= 1)2- •■ ,1%. Now assuming |Ai| is strictly 
greater than any other eigenvalue 



/+ T^TO ^ Yl 



S(a,t) 



(-ir(l-aAi) 



For any matrix M, let ||M||x = Zaj-W^j] There- 
fore, the expected number of paths is ||5(a, t)||i. The 
expected path length is given by: 



^&a fc ||ft*||i 



k=0 



fc=0 



n , d||S(a,t)||i 

" da 

\\S(a,t)\\i 
(-1) 



1 n t + 1 \ t+1 

-.'i / 1 u i y\ 1 



1-aAi + 1 - o^A^ 1 



Therefore, as t — > oo and a|Ai| < 1, the expected path 
length is approximately 1— . , and for a|Ai| > 1 it is 

o(t). 
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